-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interval Comparison #5180
Interval Comparison #5180
Conversation
I'm not sure about this, intervals cannot be accurately converted to a number of milliseconds, etc... Whilst the previous ordering might have been surprising it is a total order that matches the equality relation, this PR is neither. Perhaps we might take a step back as to what you're hoping to achieve with this? |
@@ -50,6 +50,8 @@ pub enum DataType { | |||
Int32, | |||
/// A signed 64-bit integer. | |||
Int64, | |||
/// A signed 128-bit integer. | |||
Int128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can unilaterally add array types to the arrow specification
Right, I agree that any order should match equality relation and there is a bug here. We will discuss with @berkaysynnada shortly.
Sure, thanks for offering to help. Basically the issue we are facing is that the current total order results in wrong branches being taken during pruning operations in downstream use cases (like Datafusion). We are trying to arrive at a partial order that is compatible with equality, and only returns |
I'm not sure such a relation is possible or at least it is not trivially obvious to me. Parquet for example does not produce interval statistics at all (although arrow-rs has historically been writing them, fixed by #5147) - https://github.com/apache/parquet-format/blob/066f9817332da32bdc6dc6dea833b6ee9c269934/LogicalTypes.md?plain=1#L537-L539 A couple of thoughts:
|
Yes, this was surprising to us too. The issue is there is nothing that bars a user to have a column whose type is an interval and write an expression involving that. In such cases range analysis gets things wrong because of the order issue.
Agreed, definitely not trivial. We will try to iterate on this to see if we can arrive at a relation like that -- and if we fail we will learn something and we can come up with a plan B. @berkaysynnada will post an update here today after we think about this a little more. |
Perhaps it could just bail out or something... If it isn't an expected use-case, perhaps we can just fallback to not doing range analysis or something - I have to confess to not have a good grasp of when or how frequently such a situation would arise |
bc4967e
to
9f9382e
Compare
I've revised the implementation and now the comparison semantic is as follows: The interval values on both sides of the comparison work as if they reference a common timestamp. If comparing these intervals with respect to this reference gives a definite answer, like 1 month and 1 month + 1 day, that answer is given as the result. However, if there is an indefinite case, like 1 month and 30 days, the result will be false for both greater than and less than comparisons. This means that comparisons such as "1 month < 1 month + 1 day" are now returning true. The current status of this PR includes:
|
Thanks @berkaysynnada. @tustvold, what do you think about this semantics? Seems like it improves the behavior without breaking any use case, but let's try to poke holes and make sure |
Whilst I appreciate you working on this, I feel I wasn't clear enough before:
From my perspective this seems like it is trying to frame range analysis as an ordering relation, which seems somewhat contrived? Perhaps this logic could be implemented within the range analysis framework of DataFusion, as opposed to here? |
AFAICT the way the proposed ordering relation works will not break any sorts. Since indeterminate cases return false all ways, it will act like a pseudo total order. This is what I'm trying to nail down by discussing with you -- if it breaks sorts, I agree it is a no-go.
At this point, we are trying to get the code to compile so we can iterate to find the right semantics. The purpose here is not rush a merge or make a de-facto change to arrow spec. The code may very well move to somewhere else if this is not the right place at this time. My personal take is that returning clearly wrong comparison results for interval datatypes should be fixed at this level. The library claims to support these datatypes after all, performs such comparisons without any error or warning, so returning arbitrary (relative to real-world interval semantics) comparison results silently is simply wrong IMHO. However, it doesn't really matter at this point -- we can always revisit such decisions over time. Let's try to find a good solution together first, and we can place the code somewhere else to get things moving. |
Currently the following is always true regardless of input
Or to put it differently the arrow library uses a consistent total ordering relation across its kernels, including things like the row format. Whilst I agree the ordering is surprising, and may violate people's expectations, but this is general problem for intervals where even basic arithmetic axioms don't hold, e.g. |
This will indeed be false with the proposed semantics. I understand that you want to preserve such invariants, even though I'd say it's arguable users would want such an invariant if the underlying datatype doesn't really admit it with its natural ordering semantics. The price to pay (unnatural comparisons) is quite heavy. There is unfortunately no perfect solution here, the interval data type is too weird 🙁
Such basic arithmetic axioms do not hold for floating point values either but they still have sensible ordering semantics (comparisons coincide with their natural meaning for non-NaN cases, indeterminate comparisons involving NaNs return false all ways). So if you are pointing out yet another weirdness with the interval type, I agree with you -- but these points are not really logically related or support/normalize each other. Let's fix this in Datafusion in the short term and we can revisit an upstream solution later on when we have more clarity. Maybe you or us will have even better ideas/semantics over time. |
FWIW we use IEEE total ordering for floats, which don't behave as you describe - https://docs.rs/arrow-ord/latest/arrow_ord/cmp/fn.gt.html |
It actually does behave the way I describe. The Anyway it doesn't matter, we will do the fix downstream. Thanks for the discussions which facilitated an improved semantics for this |
We do indeed, as documented in the link |
I know this PR is closed, but I would like to offer two comments:
I would like to (and I will) add a test that documents this assumption / feature in code so that we don't end up in a situation like this again where a proposed changes breaks a non obvious invariant. |
Thank you Andrew. I think the broader invariant is that we consistently order values across kernels, so things like |
Right -- I was looking for a counter example that that would be different with the two different semantic Maybe something like the following Sort order on master (even though "logically" one might expect
But if you changed comparison to the possibly more intuitive ordering:
then the order doesn't match the order of the underlying i128 🤔 |
I filed #5192 with tests and documentation explaining the current behavior and my understanding of the rationale. |
After further study of this matter, I think it could make sense to add new compairson kernels that were specific to intervals and implemented the partial order interval comparisons as proposed in this PR, Something like let arr1 = IntervalMonthDayNanoArray::from(...);
let arr2 = IntervalMonthDayNanoArray::from(...);
// compare arr1 and arr2 using interval specific logic / partial order
let res = interval_partial_lt() |
Perhaps I might suggest filing a ticket to discuss what the desired semantics are, linking back to the one in DF, as the first step before jumping into code. In particular I am curious what systems like postgres do. In general I would like to encourage this pattern of working, as it can help avoid frustration, and ensure all the necessary context is understood before work gets underway. Filing subtle and/or large PRs with no prior discussion is not setting things up for sucess. |
Yes, this is an excellent idea -- I will do so. I agree that for items where the semantics may not be clear, it often helps to start with a ticket rather than a PR |
I have filed #5200 to discuss how to handle interval comparisons (if at all) |
Which issue does this PR close?
Closes #.
Rationale for this change
When intervals are compared,
IntervalDayTimeArray
andIntervalMonthDayNanoArray
exhibit unexpected behavior. This occurs because they are treated as signed integers, but they require special handling since they consist of multiple contiguous time unit blocks.The first unexpected behavior is observed when the LSD parts of these intervals are negative numbers and the MSD's are 0; under these conditions, intervals with negative values are evaluated as if they are greater than those with positive values.
This test also fails in a similar manner.
The second problem arises when comparing relative intervals, such as 1 month and 30 days. In such cases, we can not arrive at an absolute result (e.g. days in a month can change). Therefore, to stay on the safe side, we adopt the following semantics: A comparison returns true if it holds certainly. Otherwise, it returns false. Going back to our example, we return false for both 1 month < 30 days and 1 month > 30 days. It is impossible to impose a total order on intervals, but we can impose a consistent partial order with this logic.
What changes are included in this PR?
IntervalDayTimeArray
andIntervalMonthDayNanoArray
are separately handled before passing toapply()
incompare_op()
. They will be represented with milliseconds and nanoseconds, respectively, as the maximum or minimum value that the interval may correspond to.As we need an exact representation of
IntervalMonthDayNano
in nanoseconds to prevent overflow cases, a variant for a 128-bit signed integer in DataType is added. Necessary extensions are accordingly implemented.Are there any user-facing changes?